WineQualityReds by Katrin Haller

Short Introduction

In this Exploratory Data Analysis (EDA) I will explore a dataset about the quality of red wine. This dataset contains 13 variables and 1599 observations. There are informations about the quality level, different acids/ acidity, residual sugar, alcohol, density and pH level.

Univariate Plots Section

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

In the two plots above we can see the 13 variables as well as the structure of our dataset.

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

This plot shows the first 6 rows (default) of the table, so I can get familiar with the values and columns.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

The plot above is the statistical summary of the dataset and gives a better idea of the values the variables here can take on. You can see that the mean quality is 5.6 and has a range from 3.0 to 8.0. The alcohol level has a mean of 10.42, a mean pH level of 3.3 and a mean of volatile acidity of 0.5.

Plot 1 & Plot 2

Both plots look a little right skewed, so it would be a good idea to transform the data with log10.

Now we have a normal distribution with base log10.

Plot 3 & Plot 4

These plots for citric.acid and residual.sugar look also very right skewed, so I will transform the data like above with log10.

These are the plots p3 and p4 as a normal distribution.

Plot 5 & Plot 6

As observed in the plots p1 to p4, these two plots are right skewed too and has to be transformed with log10.

Now we have a normal distribution for both sulfur dioxid values.

Plot 7 & Plot 8

These plots for density and pH level look both normal distributed, so we don’t have to transform them.

Plot 9 & Plot 10

These are the plots for chlorides and sulphates, which are also right skewed.

Now the plots for chlorides and sulphates are normal distributed.

Plot 11 & Plot 12

The plot for alcohol looks right skewed so I will transform this with log10.

Create a new variable ‘wine_rating’

A new variable was created for the rating of wine out of the variable ‘quality’.

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality wine_rating
## 1       5      medium
## 2       5      medium
## 3       5      medium
## 4       6      medium
## 5       5      medium
## 6       5      medium

Here you can see that the new variable ‘wine_rating’ was created and added to the dataset.

We can observe now that the alcohol level is mostly less than 10 and the range of quality is between 3 and 8 with peaks around 5 and 6.

Here we see the plot for the new variable ‘wine_rating’, which shows clearly that the most red wines in this dataset were rated medium.

## [1] 0

There are no missing values in the dataset.

##  [1]  143  145  468  589  653  822 1115 1133 1229 1270 1271 1476 1478

The boxplot shows some outliers in the variable ‘alcohol’.

Univariate Analysis

What is the structure of your dataset?

The dataset of RedWineQuality has 13 variables with 1599 observations. The variables list diverse chemical compounds of wine like volatile and fixed acidity, citric acid, chlorides, sulfur dioxide, sulphates, alcohol and residual sugar. But the variables also measurements like density, the level of pH and quality of wine. There are no missing values in the dataset.

What is/are the main feature(s) of interest in your dataset?

Most interesting in this dataset are the quality of wine compared to the level of alcohol, the pH level with the level of volatile acidity and residual sugar, the sulfur dioxide and eventually the density.

Did you create any new variables from existing variables in the dataset?

For a better distribution of the quality of wine I created a new variable called ‘wine_rating’ from ‘quality’. This variable set markers for low, medium and high quality of red wine. In the plot you can clearly see, that the most red wines were rated with medium quality.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

In my first outputs I looked visually at the variables and their values. After this I performed plots for each variable to see the distribution more clearly. For most of the variables I found that they were right skewed, so I transformed these plots with log10 to have a normal distribution. I also found some outliers in volatile.acidity, citric.acid, residual.sugar, free.sulfur.dioxide, total.sulfur.dioxide, pH and alcohol.

Bivariate Plots Section

Prepare Data for Correlation Matrix

##   volatile.acidity citric.acid residual.sugar free.sulfur.dioxide density
## 1             0.70        0.00            1.9                  11  0.9978
## 2             0.88        0.00            2.6                  25  0.9968
## 3             0.76        0.04            2.3                  15  0.9970
## 4             0.28        0.56            1.9                  17  0.9980
## 5             0.70        0.00            1.9                  11  0.9978
## 6             0.66        0.00            1.8                  13  0.9978
##     pH alcohol quality
## 1 3.51     9.4       5
## 2 3.20     9.8       5
## 3 3.26     9.8       5
## 4 3.16     9.8       6
## 5 3.51     9.4       5
## 6 3.51     9.4       5

Correlation Matrix

##                     volatile.acidity citric.acid residual.sugar
## volatile.acidity                1.00       -0.55           0.00
## citric.acid                    -0.55        1.00           0.14
## residual.sugar                  0.00        0.14           1.00
## free.sulfur.dioxide            -0.01       -0.06           0.19
## density                         0.02        0.36           0.36
## pH                              0.23       -0.54          -0.09
##                     free.sulfur.dioxide density    pH alcohol quality
## volatile.acidity                  -0.01    0.02  0.23   -0.20   -0.39
## citric.acid                       -0.06    0.36 -0.54    0.11    0.23
## residual.sugar                     0.19    0.36 -0.09    0.04    0.01
## free.sulfur.dioxide                1.00   -0.02  0.07   -0.07   -0.05
## density                           -0.02    1.00 -0.34   -0.50   -0.17
## pH                                 0.07   -0.34  1.00    0.21   -0.06

Here we have our correlation matrix to see where are the strongest correlations between variables.

##                  Var1             Var2 value
## 1    volatile.acidity volatile.acidity  1.00
## 2         citric.acid volatile.acidity -0.55
## 3      residual.sugar volatile.acidity  0.00
## 4 free.sulfur.dioxide volatile.acidity -0.01
## 5             density volatile.acidity  0.02
## 6                  pH volatile.acidity  0.23

Heatmap of the strongest correlations in the dataset.

Here we have the boxplots for quality vs. alcohol and quality vs. residual.sugar.

Here we have the scatterplots for the relationship of the variables in the boxplot above.

Here we have the boxplots for pH vs. volatile.acidity and residual.sugar vs. pH.

Here we have the boxplots for residual.sugar vs. volatile.acidity and alcohol vs. pH.

These are the scatterplots showing the relationship between residual.sugar vs. volatile.acidity and alcohol vs. pH level.

Here we have the boxplots for density vs. residual.sugar and density vs. citric.acid

These are the scatterplots for density vs. residual.sugar and vs. citric.acid.

Here we see the boxplots for the relationship of residual.sugar vs. free.sulfur.dioxide and for quality vs. citric.acid.

These are the proper scatterplot and boxplot for the first boxplots above.

Extra Plot

We see a higher level of quality, if we have less volatile.acidity. There are some outliers with a higher level of residual.sugar and also a higher level (>6) of citric.acid - this could be an interesting question for level of quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

My main interest was for the relationship of the variables (a) density vs. citric.acid or residual.sugar, for (b) pH level vs. volatile.acidity and (c) free.sulfur.dioxide vs. residual.sugar. For (a) density vs. residual.sugar shows for the mean of residual.sugar at 2.5 a strong relationship with the density. This is interesting because the density is determined by the concentration of alcohol and sugar. This plot can be improved in the multivariate plots section. For (a) density vs. citric.acid we have a somewhat strong relationship too. The relationship for (b) pH level vs. volatile.acidity is interesting because it grades the wine for tart and soft wine. In the plot we can observe that the pH level is mostly between 3.0 and 3.5/3.6 and the level of volatile.acidity between 0.2 and 0.8. This plot can also be improved in the multivariate section to see more relationships. The next interesting relationship is between the free.sulfur.dioxide and residual.sugar. In the plot we can see, that at a low level of sugar (less than 4) we have different concentration of free.sulfur.dioxide (between around 5 to 40). The trend shows that for higher level of sugar we have higher values of free.sulfur.dioxide. This chemical compound is important to preserve the flavor after harvest and saves the wine from further fermantation so you can store the wine for many years.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, in the end I was searching for some interesting insights in other relationships between variables. For the relationsship between the variables residual.sugar vs. citric.acid I found that there are some outliers with a higher level of residual.sugar and also a higher level (>6) of citric.acid. This could be an interesting question for level of quality. I also looked at the relationship of the variables quality vs. volatile.acidity. Here we see a higher level of quality, if we have less volatile.acidity. This might lead to the assumption that some softer wines are rated higher than tart wines.

What was the strongest relationship you found?

The strongest relationship from the correlation table is quality vs. alcohol (correlation coefficent of 0.48). This one is closely followed by density vs. citrid.acid (0.36) and density vs. residual.sugar. This is interesting because the density is determined by the concentration of alcohol and sugar, which might be important for the quality and wine_rating.

Multivariate Plots Section

Quality, alcohol and sugar

This plot shows the relationship between the variables alcohol, residual.sugar and quality as this is supposed to have the strongest relationship. We see that the quality level tends to be higher with a higher level of alcohol.

This plot shows the relationship between alcohol, density and quality. We can observe that with a higher level of alcohol and a low level of residual.sugar we have the lowest level of density.

This plot shows the quality more clearly, because we used the created variable ‘wine_rating’. As you see there are some outliers for medium rating with a lower level of alcohol and a high level of sugar. But mostly the high rated wines have a level of residual.sugar lower than 8, mostly lower than 6 and a level of alcohol betwenn 10 and 14.

Tart or soft wine?

Here we see the highest wine_ratings for pH levels mostly between 3.0 and 3.5 and a volatile.acidity level of less than 0.6. For higher level of volatile.acidity we see a lower wine_rating.

Quality, sulfur dioxide and sugar

This plot shows the quality for red wines based on the realtionship of residual.sugar and free.sulfur.dioxide.

This plot shows two different results. First that with a lower level of residual.sugar and free.sulfur.dioxide the wine_rating is more likely to be high or medium. But there are also values that shows that with a little higher level of residual.sugar but lower level of free.sulfur.dioxide the wine_rating is more likely to be high. We also see some outliers for a lot of residual.sugar and/ or more residual.sugar and high leves of free.sulfur.dioxide having a wine_rating of medium.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I continued to investigate the relationships I had in the bivariate analysis. In the multivariate analysis I now added a variable, mostly quality or wine_rating to find out, which wines had been highly rated and for what reason. First I looked at the relationship between alcohol, residual.sugar, density and quality, later wine_rating. Most interesting was that the quality level tends to be higher with a higher level of alcohol and a medium level of residual.sugar. Next I took a look at the relationship between volatile.acidity and pH level. This one was really interesting because here you can see if soft or tart wine is rated more highly. In this case wines with a lower volatile.acidity and also only a pH level of 3.0 to 3.5 are preferred. Last I wanted to know more about the rating for the relationship of free.sulfur.dioxide and residual.sugar. Sulfur dioxide is taken for fermantation purposes to get wines which can be stored for years and also for saving the flavor of the grapes after the harvest. I found out that this plot was not so clear since there are two possible directions. One result was that with a lower level of residual.sugar and free.sulfur.dioxide the wine_rating is more likely to be high or medium. And the second result ist that with a little higher level of residual.sugar but lower level of free.sulfur.dioxide the wine_rating is also more likely to be high.

Were there any interesting or surprising interactions between features?

One thing I haven’t expected is that residual.sugar doesn’t seem to have that much impact on the quality/ wine_rating. I would have also expected that some plots will have a more clear trend in one direction.


Final Plots and Summary

The dataset of RedWineQuality was very interesting to explore. My final plots summarize the relationship for some of the variables with the strongest correlation as well as variables which seems to have an impact on the rating for the quality of wine. The trend is that red wines which have a lower volatile.acidity are more preferred. These wines tend to be more softer and balanced, which is reasonable to me. Usually red wines have a pH level of 3.3-3.6, white wine usually have a pH level of 3.0-3.4. Most of the red wines here have a pH level between 3.0 and 3.5. In combination with the volatile.acidity these wines tend to be more soft. But also with a higher level of alcohol, which then was rated much higher. The rating of the quality of wine was medium with a trend for high.

Plot One

Description One

I choose the combination of these two plots because they show the level of alcohol and residual.sugar. The alcohol level seems to be more often lower than 10 or 12 and the level of residual.sugar also shows a lower level which tends to be lower than 4 on a scale of 10.

Plot Two

Description Two

This plot explains the quality levels for volatile.acidity, which tend to have a higher level of quality for lower level of volatile.acidity. This is interesting because this shows that more balanced wines are preferred.

Plot Three

Description Three

Finally, I choose this plot to show the relationship between volatile.acidity and the pH level, because this gives us some information about the kind of wine if it is a more soft or tart wine. We see that the highest rating was for wines with a pH level between 3.0 to 3.5 and a level of volatile.acidity less than 0.8, mostly less than 0.6.


Reflection

The dataset of RedWineQuality was very interesting to explore. Maybe it would have been helpful to have more variables since some of the given variables didn’t seem to have much impact. But this was a first exploration, so out of this a second and third one could follow up. One of the struggles I had were to find a trend out of the data, because in some plots it seems there have been more than one result and so it was not always easy to interpreted the data for the next step. Although some plots were easy going. As expected the quality level was the highest for the medium wine_rating. Maybe this is also some kind of social desirability as it maybe was not always easy to determine if one wine was better than the other. Some interesting future work to follow up with could include the comparision between red and white wines. Are there different ratings in total and which wine is highly preferred - white wine or red wine? Further interesting questions would be for the region of the wine or which grapes have been used. Also which age and sex were the participants who rated the wine, is there a preference of white or red wine for one sex or age or both. Also if female and/or male participants drink wine more often or just occasionally. It must also be consideres that not everyone has a lot of experience to rate a wine and it’s taste/flavor etc.